{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linear regression using scikit-learn\n",
    "\n",
    "In the previous notebook, we presented the parametrization of a linear model.\n",
    "During the exercise, you saw that varying parameters gives different models\n",
    "that may fit better or worse the data. To evaluate quantitatively this\n",
    "goodness of fit, you implemented a so-called metric.\n",
    "\n",
    "When doing machine learning, one is interested in selecting the model which\n",
    "minimizes the error on the data available the most. From the previous\n",
    "exercise, we could implement a brute-force approach, varying the weights and\n",
    "intercept and select the model with the lowest error.\n",
    "\n",
    "Hopefully, this problem of finding the best parameters values (i.e. that\n",
    "result in the lowest error) can be solved without the need to check every\n",
    "potential parameter combination. Indeed, this problem has a closed-form\n",
    "solution: the best parameter values can be found by solving an equation. This\n",
    "avoids the need for brute-force search. This strategy is implemented in\n",
    "scikit-learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "penguins = pd.read_csv(\"../datasets/penguins_regression.csv\")\n",
    "feature_name = \"Flipper Length (mm)\"\n",
    "target_name = \"Body Mass (g)\"\n",
    "data, target = penguins[[feature_name]], penguins[target_name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "linear_regression = LinearRegression()\n",
    "linear_regression.fit(data, target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The instance `linear_regression` stores the parameter values in the attributes\n",
    "`coef_` and `intercept_`. We can check what the optimal model found is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weight_flipper_length = linear_regression.coef_[0]\n",
    "weight_flipper_length"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "intercept_body_mass = linear_regression.intercept_\n",
    "intercept_body_mass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the weight and intercept to plot the model found using the\n",
    "scikit-learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "flipper_length_range = np.linspace(data.min(), data.max(), num=300)\n",
    "predicted_body_mass = (\n",
    "    weight_flipper_length * flipper_length_range + intercept_body_mass\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "sns.scatterplot(x=data[feature_name], y=target, color=\"black\", alpha=0.5)\n",
    "plt.plot(flipper_length_range, predicted_body_mass)\n",
    "_ = plt.title(\"Model using LinearRegression from scikit-learn\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the solution of the previous exercise, we implemented a function to compute\n",
    "the goodness of fit of a model. Indeed, we mentioned two metrics: (i) the\n",
    "[mean squared\n",
    "error](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error)\n",
    "and (ii) the [mean absolute\n",
    "error](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-absolute-error).\n",
    "Let's see how to use the implementations from scikit-learn in the following.\n",
    "\n",
    "We can first compute the mean squared error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "inferred_body_mass = linear_regression.predict(data)\n",
    "model_error = mean_squared_error(target, inferred_body_mass)\n",
    "print(f\"The mean squared error of the optimal model is {model_error:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A linear regression model minimizes the mean squared error on the training\n",
    "set. This means that the parameters obtained after the fit (i.e. `coef_` and\n",
    "`intercept_`) are the optimal parameters that minimizes the mean squared\n",
    "error. In other words, any other choice of parameters would yield a model with\n",
    "a higher mean squared error on the training set.\n",
    "\n",
    "However, the mean squared error is difficult to interpret. The mean absolute\n",
    "error is more intuitive since it provides an error in the same unit as the one\n",
    "of the target."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_absolute_error\n",
    "\n",
    "model_error = mean_absolute_error(target, inferred_body_mass)\n",
    "print(f\"The mean absolute error of the optimal model is {model_error:.2f} g\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A mean absolute error of 313 means that in average, our model make an error of\n",
    "\u00b1 313 grams when predicting the body mass of a penguin given its flipper\n",
    "length."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, you saw how to train a linear regression model using\n",
    "scikit-learn."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}